R is available for free for Windows , GNU/Linux , and macOS .
To install R, you can go to this https://cloud.r-project.org/. The latest available release is R 4.3.3 “Angel Food Cake” released on 2024-02/29, but any (fairly recent) version will do.
Install RStudio IDE
RStudio Desktop is an Integrated Development Editor (IDE), basically a graphical interface wrapping and interfacing R (which needs to be installed first).
Besides RStudio, R (which is a command line driven program) can be executed:
via its native interface (R GUI)
from many other code editors, like VS Code, Sublime Text, Jupyter Notebook
An R Project will keep all the files associated with a project (including invisible ones!) organized together – input data, R scripts, analytical results, figures. Besides being common practice, this has the advantage of implicitly setting the “working directory”, which is incredibly important when you need to load or output files, specifying their file path.
In Figure 1 you can see how easy it is just following RStudio prompts:
Create a new directory for each project
Select parent folder
Creating an R Project [in Rstudio] (cont.)
Figure 1: Creating an R project
Install R packages from CRAN (stable version)
An R package* is a shareable bundle of functions. Besides the basic built-in functions already contained in the program (i.e. the base package), many useful R functions come in free libraries of code (or packages) written by R’s users. You can find them in different repositories:
To install a package use utils function install.packages("package_name)
# Installing (ONLY the 1st time)utils::install.packages('here')# OR (same)install.packages('here')
Install R packages from GitHub (testing version)
With the package devtools and its function install_github to install the developer’s version of a package. Let’s try it with a little package paint (which colors the structure of dataset when printing).
# Installing devtools (ONLY the 1st time)utils::install.packages('devtools')# Installing paint from GitHub library(devtools)devtools::install_github("MilesMcBain/paint")# test paint outlibrary(paint)
# it will show me the structure of a data.frame like this... paint::paint(mtcars)
Install R packages RStudio pane
You can also install and update packages using the “Packages” tab on the lower right pane of RStudio.
Screenshot Install/Update pckgs from RStudio
Use R Packages
We will be using {base} & {utils} (pre-installed and pre-loaded)
We will also use the packages below (specifying package::function for clarity).
# Load them for this R sessionlibrary(here) # tools find your project's files, based on working directorylibrary(janitor) # tools for examining and cleaning datalibrary(skimr) # tools for summary statistics library(dplyr) # {tidyverse} tools for manipulating and summarising tidy data library(ggplot2) # {tidyverse} tools for plottinglibrary(forcats) # {tidyverse} tool for handling factorslibrary(ggridges) # alternative to plot density functions library(fs) # file/directory interactions
Help on R package/function
To inquire about a package and/or its functions, you can again write in your console ?package_name or ??package_name and RStudio will open up the Help page in the lower right pane.
# Opening Help page on package/function?here??here
File paths logistics
It is never good practice to “hard code” the file’s absolute path: most likely it will break your code as soon as you (or someone else) need to run it on a different computer, let alone within a different OS.
So if your code to read & load a file is written like this:
# [NOT REPRODUCIBLE] hard coding your file path -----------------------# File path on Mac:dataset <- readr::read_csv("/Users/testuser/R4biostats/input_data/dataset.csv")# Same file path on Windows:dataset <- readr::read_csv("C:\Users\testuser\R4biostats\input_data\dataset.csv")
…it won’t work on someone else’s computer since they don’t have that same file structure!
(Reproducible) file paths with here [in Rstudio]
The here package lets you reference file paths in a reproducible manner (anchored on the R Project’s folder as the root).
Where is my Working Directory?
here::here()
You should get “/Users/dir/sub_dir/proj_name”
Create a Sub-Directory with fs package (for saving input data and output data)
# with `here` I simply add subfolder names relative to my wd fs::dir_create(here("practice", "test","data_input"))# ...and a subfolder to put output files at the endfs::dir_create(here("practice", "test","data_output"))## --- [if I need to remove it (I have them already)]fs::dir_delete(here("practice", "test"))
R OBJECTS, FUNCTIONS, PACKAGES
Importing data into R workspace
We are using real data provided by Thabtah,Fadi. (2017). Autism Screening Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5F019
autism_data_url <-read.csv(file ="https://raw.githubusercontent.com/Sydney-Informatics-Hub/lessonbmc/gh-pages/_episodes_rmd/data/autism_data.csv", header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?") # specific values R should interpret as NA)
Option 2: from my folder (upon downloading)
# Check my working directory location# here::here()# Use `here` in specifying all the subfolders AFTER the working directory autism_data_file <-read.csv(file =here("practice", "data_input", "01_datasets", "autism_data.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?"),# specific values R should interpret as NArow.names =NULL)
DATA OBSERVATION & MANIPULATION
Viewing the dataset
View(autism_data_file) # (or click on it in Enviroment tab)
# What data type is this data?class(autism_data_file)
[1] "data.frame"
# What variables are included in this dataset?base::colnames(autism_data_file)
[1] "White-European" "Latino" "Latino" "White-European"
[5] NA "Others"
Option 2 using indexing [#row, ] from base”
# Indexing to pick `[#row, ]` head(autism_pids[1 , ] ) # empty cols means all
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
1 1 1 1 1 1 0 0 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
1 0 0 26 f White-European no no United States
used_app_before result age_desc relation Class_ASD pids
1 no 6 18 and more Self NO PatientID_1
head(autism_pids[50,])
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
50 50 1 1 0 0 0 1 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
50 0 1 30 f Asian no no Bangladesh
used_app_before result age_desc relation Class_ASD pids
50 no 6 18 and more Self NO PatientID_50
head(autism_pids[25:26 ,])
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
25 25 1 1 1 1 0 0 0 1
26 26 0 1 1 0 0 0 0 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
25 0 0 43 m <NA> no no Lebanon
26 0 0 24 f <NA> yes no Afghanistan
used_app_before result age_desc relation Class_ASD pids
25 no 5 18 and more <NA> NO PatientID_25
26 no 3 18 and more <NA> NO PatientID_26
Option 2 using indexing [#row, #col] from base”
# Indexing to pick `[#row, #col]` autism_pids[1:3,1]
[1] 1 2 3
autism_pids[1:3,23]
[1] "PatientID_1" "PatientID_2" "PatientID_3"
autism_pids[1:3,2]
[1] 1 1 1
autism_pids[1:3,14]
[1] "White-European" "Latino" "Latino"
What are the data types of the variables?
Option 1 using base functions
# What are the data types of the variables? ---------------------------------str(autism_pids) # integer and character
#### char 2 factor -------------------------------------------------------------# Say I want to treat some variables as factorsautism_pids$gender <-as.factor(autism_pids$gender)autism_pids$ethnicity <-as.factor(autism_pids$ethnicity)autism_pids$contry_of_res <-as.factor(autism_pids$contry_of_res)autism_pids$relation <-as.factor(autism_pids$relation)# check class(autism_pids$gender)
[1] "factor"
class(autism_pids$ethnicity)
[1] "factor"
class(autism_pids$contry_of_res)
[1] "factor"
class(autism_pids$relation)
[1] "factor"
# now I have Variable type: factor
autism_pids_temp <- autism_pids # copy df for test to_factor <-c("gender", "ethnicity", "contry_of_res", "relation") # vector of col names autism_pids_temp[ ,to_factor] <-lapply(X = autism_pids[ ,to_factor], FUN = as.factor)# check class(autism_pids_temp$gender)
# observe a subset of some columns autism_subset <- autism_pids [1:5, c("gender","jaundice", "autism", "age_desc", "Class_ASD","pids")]# View(autism_subset)# recode "age_desc" as LOGICAL new var "age_desc_log"autism_pids$age_desc_log <-ifelse(autism_pids$age_desc =="18 and more", TRUE, FALSE )class(autism_pids$age_desc)
[1] "character"
class(autism_pids$age_desc_log)
[1] "logical"
from char to dummy [0,1]
I also may need my binary variables expressed as 01 (e.g. to incorporate nominal variables into regression analysis)
head(autism_pids) #return fist 6 obstail(autism_pids) #return last 6 obs
using {utils} head or tail (cont.)
head(autism_pids, n =2) #return fist 2 obs
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
1 1 1 1 1 1 0 0 1 1
2 2 1 1 0 1 0 0 0 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
1 0 0 26 f White-European no no United States
2 0 1 24 m Latino no yes Brazil
used_app_before result age_desc relation Class_ASD pids
1 no 6 18 and more Self NO PatientID_1
2 no 5 18 and more Self NO PatientID_2
age_desc_log autism_dummy
1 TRUE 0
2 TRUE 1
tail(autism_pids, n =2) #return last 2 obs
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
703 703 1 0 0 1 1 0 1 0
704 704 1 0 1 1 1 0 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
703 1 1 35 m South Asian no no Pakistan
704 1 1 26 f White-European no no Cyprus
used_app_before result age_desc relation Class_ASD pids
703 no 6 18 and more Self NO PatientID_703
704 no 8 18 and more Self YES PatientID_704
age_desc_log autism_dummy
703 TRUE 0
704 TRUE 0
Investigating a subset of observations
E.g. I learned that some patients have missing age… how many are they?
# run...sum(is.na(autism_pids$age))
[1] 2
# or skimr::n_missing(autism_pids$age)
[1] 2
Next, I want to ID those patients with missing age
New df (only the patients with missing age) as SUBSET of the given df
I want to extract only the obs (rows) of interest with a few useful vars (cols)
pids age autism_dummy
63 PatientID_63 NA 0
92 PatientID_92 NA 0
Subset using subset from base
# arguments allow me to specify rows and cols missing_age_subset2 <-subset(x = autism_pids, subset =is.na(autism_pids$age), # 1 logical conditionselect =c("pids", "age", "autism_dummy") # which cols ) missing_age_subset2
pids age autism_dummy
63 PatientID_63 NA 0
92 PatientID_92 NA 0
# Creates a SUBSET based on MORE conditions (`age` and `ethnicity`)subset_2cond <-subset(x = autism_pids, # 2 logical conditions subset = age <50& contry_of_res =="Brazil", # pick a few cols select =c("pids", "age", "contry_of_res","autism_dummy")) subset_2cond
pids age contry_of_res autism_dummy
2 PatientID_2 24 Brazil 1
54 PatientID_54 21 Brazil 1
94 PatientID_94 19 Brazil 1
169 PatientID_169 36 Brazil 1
170 PatientID_170 36 Brazil 1
429 PatientID_429 20 Brazil 0
587 PatientID_587 21 Brazil 0
588 PatientID_588 21 Brazil 0
696 PatientID_696 28 Brazil 0
Subset using {dplyr} filter and select
Switching to the package dplyr and embracing the “pipe” (%>%) operator logic, in which the filtering (rows) and selecting (columns) is done in sequence
## here the filtering (rows) and selecting (columns) is done in sequencetwocond_dplyr_subset <- autism_pids %>% dplyr::filter(age <50& contry_of_res =="Brazil") %>%# which rows dplyr::select (pids, age, contry_of_res, autism_dummy) # which colstwocond_dplyr_subset
pids age contry_of_res autism_dummy
1 PatientID_2 24 Brazil 1
2 PatientID_54 21 Brazil 1
3 PatientID_94 19 Brazil 1
4 PatientID_169 36 Brazil 1
5 PatientID_170 36 Brazil 1
6 PatientID_429 20 Brazil 0
7 PatientID_587 21 Brazil 0
8 PatientID_588 21 Brazil 0
9 PatientID_696 28 Brazil 0
Dealing with missing data
Input values where missing
⚠︎ WARNING ⚠︎ This is a very delicate step, because any data that is modified or imputed beyond the original collection can affect the result of subsequent analysis and statistical modeling.
Furthermore, it will be necessary to document and justify whichever approach is used to deal with missing data.
# 1/2 create a new variable autism_pids$age_inputed <- autism_pids$age# 2/2 replace value (presumably taken from other source) of `aged_inputed` # CONDITIONAL on `pids`autism_pids$age_inputed[autism_pids$pids =="PatientID_63"] <-65autism_pids$age_inputed[autism_pids$pids =="PatientID_92"] <-75# checkskimr::n_missing(autism_pids$age)
[1] 2
skimr::n_missing(autism_pids$age_inputed)
[1] 0
DESCRIPTIVE STATISTICS
Summarizing all variables
{base} summary
summary(autism_pids)
{skimr} skim
skimr::skim(autism_pids)
Notice the different treatment according to the variable type
The function’s results depend on the class of the object
integer (A1_Score)
summary(autism_pids$A1_Score) # min, max quartiles, mean, median
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.7216 1.0000 1.0000
factor (ethnicity)
summary(autism_pids$ethnicity) # counts of levels' frequency (included NA!)
Asian Black Hispanic Latino Middle Eastern
123 43 13 20 92
others Others Pasifika South Asian Turkish
1 30 12 36 6
White-European NA's
233 95
Notice the different treatment according to the variable type
logical (age_desc_log)
summary(autism_pids$age_desc_log) # counts of TRUE
Mode TRUE
logical 704
Frequency tables
Frequency distributions can be used for nominal, ordinal, or interval/ration variables
# by(data$column, data$grouping_column, mean)by(data = autism_pids$age_inputed, INDICES = autism_pids$gender, FUN = mean)
autism_pids$gender: f
[1] 29.69139
------------------------------------------------------------
autism_pids$gender: m
[1] 28.98365
# i.e. apply a function to subsets of a vector or array, split by one or more factors.tapply(X = autism_pids$age_inputed, INDEX = autism_pids$gender, FUN = mean)
f_calc_mode <-function(x) { # `unique` returns a vector of unique values uni_x <-unique(x) # `match` returns the index positions of 1st vector against 2nd vector match_x <-match(x, uni_x)# `tabulate` count the occurrences of integer values in a vector. tab_x <-tabulate(match_x) # returns element of uni_x that corresponds to max occurrences uni_x[tab_x ==max(tab_x)]}
Standard deviation Sample \(s = \sqrt\frac{\sum{(x_i-\bar{x})^2}}{n-1}\) & Population \(\sigma = \sqrt\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}\)
var(autism_pids$age)
[1] NA
var(autism_pids$age_inputed)
[1] 98.81338
sd(autism_pids$age)
[1] NA
sd(autism_pids$age_inputed)
[1] 9.940492
VISUAL DATA EXPLORATION
Plotting with ggplot2
ggplot2 provides a set of tools to map data to visual elements on a plot, to specify the kind of plot you want, and then subsequently to control the fine details of how it will be displayed. It basically allows to build a plot layer by layer (Figure 2).
data -> specify what our dataset is
aesthetic mappings (or just aesthetics) -> specify which dataset’s variables will turn into the plot elements (e.g. \(x\) and \(y\) values, or categorical variable into colors, points, and shapes).
geom -> the overall type of plot, e.g. geom_point() makes scatterplots, geom_bar() makes barplots, geom_boxplot() makes boxplots.
# notice the `%>%` before using ggplot2...autism_pids %>%# then `+` when using ggplot2ggplot(aes(x = age_inputed )) +geom_histogram() +theme_bw()
… or (piped data)
… bin width
Histograms split the data into ranges (bins) and show the number of observations in each. Hence, it’s important to pick widths that represents the data well.
autism_pids %>%ggplot(aes(x = age_inputed )) +# specify to avoid warning if we fail to specify the number of bins geom_histogram(bins=40) +theme_bw()
… bin width
… mean vertical line
autism_pids %>%ggplot(aes(x = age_inputed )) +# specify to avoid warning if we fail to specify the number of bins geom_histogram(bins=40) +# add mean vertical linegeom_vline(xintercept =mean(autism_pids$age_inputed),na.rm =FALSE,lwd=1,linetype=2,color="#9b2339") +# add small annotations (such as text labels) annotate("text", # coordinates for positioning aesthetics on the graphx =mean(autism_pids$age_inputed) *1.4,y =mean(autism_pids$age_inputed) *1.7,label =paste("Mean =", round(mean(autism_pids$age_inputed), digits =2)),col ="#9b2339",size =4)+theme_bw()
# trying to improve readability autism_pids %>%ggplot(mapping =aes(x = age_inputed, fill = gender )) +# bars next to each other with `position = 'dodge'`geom_histogram(bins=40, position ='dodge') +theme_bw()
… shifting bars by group
…facet by gender
That’s still not very easy to digest. Instead of only filling, you can separate the data into multiple plots to improve readability
autism_pids %>%ggplot(aes(x = age_inputed, fill = gender )) +geom_histogram(color="#e9ecef", alpha=0.8, position ='dodge') +theme_bw() +# splitting the gender groups, specifying `ncol` to see one above the otherfacet_wrap(~gender, ncol =1) +scale_fill_cyclical(values =c("#9b2339","#005ca1"))
…facet by gender
… adding 2 mean vert lines (by gender)
I want to see the mean vertical line for each of the subgroups, but now, I need to create a small dataframae of summary statistics.
I do so by using dplyr add a column mean_age with the group mean
# A tibble: 4 × 4
gender Stat Value label
<fct> <chr> <dbl> <chr>
1 f mean_age 29.7 f_mean_age
2 f median_age 28 f_median_age
3 m mean_age 29.0 m_mean_age
4 m median_age 26 m_median_age
…facet by gender + 2 mean vert lines + scales
hist_plot <- autism_pids %>%ggplot(aes(x = age_inputed, fill = gender)) +geom_histogram(bins=30,color="#e9ecef", alpha=0.8, position ='dodge') +facet_wrap(~gender, ncol =1 ) +scale_fill_manual(values =c("#9b2339","#005ca1")) +# adding vline geom_vline(data = group_stats_long, mapping =aes(xintercept = Value, color = Stat),lwd=1.5,linetype=6, ) +labs(x ="age brackets", y ="n of individuals",color ="Stats",title ="Distribution of observations by gender",subtitle ="",caption ="Autism study") +theme_bw() +theme(legend.position ="right",plot.title =element_text(face ="bold")) +# increase number of x axis ticks scale_color_manual(values =c( "#e68000", "#d8cf71")) +scale_x_continuous(breaks =seq(10, 100,10 ), limits =c(16, 86))hist_plot
…facet by gender + 2 mean vert lines + scales
Density ggridges package
As an alternative, you can use the {ggridges} package to make ridge plots. The geom geom_density_ridges calculates density estimates from the provided data and then plots those, using the ridgeline visualization. In this case I have added the median line.
# library(ggridges)autism_pids %>%# this takes also `y` = groupggplot(aes(x=age_inputed, y = gender, fill = gender)) + ggridges::geom_density_ridges() +# I can add quantile lines (2 is the median)stat_density_ridges(quantile_lines =TRUE, quantiles =c(0.5), alpha =0.75)+# increase number of x axis ticks scale_x_continuous(breaks =seq(10, 100,10 ), limits =c(16, 86)) +scale_fill_cyclical(values =c("#9b2339","#005ca1")) +theme_bw()
Density ggridges package
Barchart
Bar charts provide a visual presentation of categorical data, with geom_bar() (height of the bar proportional to the number of cases in each group)
# Let's take a variable that we recoded as `factor`class(autism_pids$ethnicity)
autism_pids %>%ggplot(aes(x = ethnicity )) +geom_bar(fill ="steelblue") +# reference line geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +# --- wrap long x labels (flipped ) !!!# scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +theme_bw() +theme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold"))
…improve theme
…improve readability (reorder)
Reordering the bars by count using the package forcats and its function fct_infreq + (which we can do because ethnicity is coded as factor)
autism_pids %>%# we modify our x like so ggplot(aes(x = forcats::fct_infreq(ethnicity ))) +geom_bar(fill ="steelblue") +geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +# --- wrap long x labels (flipped ) !!!# scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +theme_bw() +theme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold"))
…improve readability (reorder)
…improve readability (highlight NA)
Let’s highlight the fact that the last column is the number of missing values.
# Highlight "NA" as separate categoryautism_pids %>%## --- prep the dataframe # Add a factor variable with two levels to use as fill/highlight dplyr::mutate(highlight = forcats::fct_other( ethnicity, keep ="NA", other_level ="All Groups")) %>%## --- ggplot # we ADD to `aes mapping` also `fill = highlight`ggplot(aes(x = forcats::fct_infreq(ethnicity), fill = highlight)) +geom_bar()+# Use custom color palettesscale_fill_manual(values=c("#0084e6")) +# Add a line at a significant level geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +# --- wrap long x labels (flipped ) !!!# scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +theme_bw() +theme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold")) +## drop legend and Y-axis titletheme(legend.position ="none")
…improve readability (highlight NA)
Boxplot
As discussed in lecture 1, the boxplot is packed with information about a distribution.